Tree Based Methods

11/05/2019

Agenda

Regression trees
Classification trees
Bagging and Random Forests
Variable Importance measures
Boosting trees

Introduction

Here we describe tree-based methods for regression and classification
These involve stratifying or segmenting the predictor space into a number of simple regions
Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision-tree methods

\[\widehat{Y} = \widehat{f}(X) = \text{Tree}(X)\]

Decision trees for regression

One dimensional example

At each internal node there is a decision rule of the form \(\{x < c\}\): if \(x < c\) go left, otherwise go right

Every rule adds a split in the predictor space
The set of leaves (bottom nodes) gives us a partition of the predictor space \(\mathcal{X}\) into disjoint regions (with just one \(X\), it is a set of intervals)

One dimensional example

Within each region (interval) we compute the average of the \(y\) values for the subset of training data in the region

This gives us an estimated \(\hat{f}\) (step function)
To predict, we drop \(x\) down the tree until it lands in a leaf and then predict the average of the \(y\) values for the training observations in the same leaf

Two dimensional example

Now the decision rules can use either of the two \(x\)’s.

Two dimensional example

Regression function given by the tree
It is a “step-like” function
It delivers non-linearity and interactions in a simple way
It works with a lot of variables

Interpretation of results

lstat (\(\%\) lower status of the population) is the most important factor in determining medv (median value of houses), and houses with lower lstat have higher values compared to houses with larger lstat
Given that a house is in a neighbourhood with large lstat, the distance to employment centres seems to play little role in the house value

\[\text{ }\]

However, among neighbourhoods with smaller lstat, the distance to employment centres does affect the house value, and houses that are closer to employment centres have higher values

Shallow vs deep trees

Let’s see a deeper version of the previous tree.

Shallow vs deep trees

Notice the interaction: the effect of dis depends on lstat!

Details of the tree building process

We divide the predictor space - that is, the set of possible values for \(X_1, X_2, \dots, X_p\) - into \(J\) distinct and non-overlapping regions \(R_1, R_2, \dots, R_J\)
For every observation that falls into the region \(R_j\), we make the same prediction \(\hat{y}_{R_j}\), which is simply the mean of the response values for the training observations in \(R_j\)
In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model
The goal is to find boxes \(R_1, \dots, R_J\) that minimize the RSS, given by \(J\) where \[\sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]

Details of the tree building process

Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into \(J\) boxes
For this reason, we take a top-down, greedy approach that is known as recursive binary splitting
The approach is top-down because it begins at the top of the tree and then successively splits the predictor space; each split is indicated via two new branches further down on the tree
It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step

Details of the tree building process

We first select the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \(\{X \mid X_j < s\}\) and \(\{X \mid X_j \leq s\}\) leads to the greatest possible reduction in RSS
Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions
However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions
Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations

Two dimensional example

Pruning a tree

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. Why?
A smaller tree with fewer splits (that is, fewer regions \(R_1, \dots, R_J\)) might lead to lower variance and better interpretation at the cost of a little bias
One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold
This strategy will result in smaller trees, but it is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split - that is, a split that leads to a large reduction in RSS later on

Pruning a tree

A better strategy is to grow a very large tree \(T_0\), and then prune it back in order to obtain a subtree
We consider a sequence of trees indexed by a nonnegative tuning parameter \(\alpha\). For each value of \(\alpha\) there corresponds a subtree \(T \subset T_0\) such that \[C_{\alpha}(T) = \sum_{m=1}^{|T|} \sum_{i \text{ s.t. } x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\] is as small as possible. Here \(|T|\) indicates the number of terminal nodes of the tree \(T\)
Given a current pruned tree, examine every pair of bottom nodes (having the same parent node) and consider eliminating the pair. Prune the pair that gives the biggest decrease in our criterion \(C_{\alpha}(T)\). This gives us a sequence of subtrees of our initial big tree

Pruning a tree

The tuning parameter \(\alpha\) controls a trade-off between the subtree’s complexity and its fit to the training data (it should remind you of \(\lambda\) in the lasso)
We select an optimal value \(\hat{\alpha}\) using cross-validation
We then return to the full data set and obtain the subtree corresponding to \(\hat{\alpha}\)

Bias variance tradeoff

As we saw, the key idea is that a complex tree is a big tree. We usually measure the complexity of the tree by the number of leaves.

Classification trees

Very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one
For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs
Just as in the regression setting, we use recursive binary splitting to grow a classification tree \[C_{\alpha}(T) = \text{RSS} + \alpha |T|\]
In the classification setting, RSS cannot be used as a criterion for making the binary splits: we could use the misclassification rate

Classification trees

However, in practice two other measures are preferable: the Gini index and the entropy.

The Gini index is defined by \[G_m = \sum_{k=1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}),\] where \(\hat{p}_{mk}\) represents the proportion of observations in the \(m^{th}\) region that are from the \(k^{th}\) class. It is a measure of total variance across the \(K\) classes.

An alternative to the Gini index is entropy, given by \[D_m = - \sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk})\]

Example

Hockey penalty data: the response is whether or not the next penalty is on the other team and \(x\) contains a bunch of variables about the current game (score, …).

Example

The form of the decision rule cannot be \(\{x < c\}\) for categorical variables: the tree picks a subset of the levels to go left. inrow2 = 0 means that all the observations with inrow2 in the category labeled 0 go left.

If:

you are not winning
you had the last two penalties
it has not been long since the last call
and there is only 1 referee

then there is a \(72\%\) chance the next call will be on the other team.

\[\text{}\] Whilst there is another game situation where the chance the next call is on the other team is only \(41\%\).

Recap

Pros and cons

Good:

Flexible fitters, capture non-linearity and interactions
The scale of variables does not matter
Handles both categorical and numeric \(y\) and \(X\) very nicely
Simple interpretation (if the tree is small)

Bad:

Not the best in out-of-sample predictive performance

GAM: complexity given by nonlinearity, interpretability comes from additivity.
Trees: complexity given by interactions, interpretability comes from simplicity of the fitted functions.

Sacrificing interpretation for performance

If we bag or boost trees, we can get the best off-the-shelf prediction available.

The idea is to combine a large number (hundreds, thousands) of trees to get an overall predictor.

Ensemble methods: we can improve predictive accuracy at the expense of some loss of interpretation.

Details in the next class…

Test

What tree has larger variance?

Test

What would you predict for the value of a house with lstat = 6 and dis = 3?

Agenda

Introduction

Decision trees for regression

One dimensional example

One dimensional example

Two dimensional example

Two dimensional example

Two dimensional example

Interpretation of results

Shallow vs deep trees

Shallow vs deep trees

Details of the tree building process

Details of the tree building process

Details of the tree building process

Two dimensional example

Two dimensional example

Two dimensional example

Two dimensional example

Pruning a tree

Pruning a tree

Pruning a tree

Bias variance tradeoff

Classification trees

Classification trees

Classification trees

Example

Example

Recap

Pros and cons

Sacrificing interpretation for performance

Test

Test

Question time